﻿<!DOCTYPE html>
<html>

<head>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Pandas之旅(七) 谁说pandas慢</title>
  <link rel="stylesheet" href="https://stackedit.io/style.css" />
</head>

<body class="stackedit">
  <div class="stackedit__left">
    <div class="stackedit__toc">
      
<ul>
<li>
<ul>
<li><a href="#pandas-加速">Pandas 加速</a></li>
<li><a href="#使用datetime类型来处理和时间序列有关的数据">1. 使用datetime类型来处理和时间序列有关的数据</a></li>
<li><a href="#批量计算的技巧">2. 批量计算的技巧</a>
<ul>
<li></li>
</ul>
</li>
<li><a href="#通过hdfstore存储数据节省时间">3. 通过HDFStore存储数据节省时间</a></li>
<li><a href="#源码，相关数据及github地址">4. 源码，相关数据及GitHub地址</a></li>
</ul>
</li>
</ul>

    </div>
  </div>
  <div class="stackedit__right">
    <div class="stackedit__html">
      <h2 id="pandas-加速">Pandas 加速</h2>
<p>大家好，今天我们来看有关pandas加速的小技巧，不知道大家在刚刚接触pandas的时候有没有听过如下的说法</p>
<blockquote>
<p><em><strong>pandas太慢了，运行要等半天</strong></em>**</p>
</blockquote>
<p>其实我想说的是，慢不是pandas的错，大家要知道pandas本身是在Numpy上建立起来的包，在很多情况下是支持向量化运算的，而且还有C的底层设计，所以我今天<br>
主要想从几个方面和大家分享一下pandas加速的小技巧，与往常一样，文章分成四部分，本文结构如下：</p>
<ol>
<li>使用datetime类型来处理和时间序列有关的数据</li>
<li>批量计算的技巧</li>
<li>通过HDFStore存储数据节省时间</li>
<li>源码，相关数据及GitHub地址</li>
</ol>
<p>现在就让我们开始吧</p>
<h2 id="使用datetime类型来处理和时间序列有关的数据">1. 使用datetime类型来处理和时间序列有关的数据</h2>
<p>首先这里我们使用的数据源是一个电力消耗情况的数据(energy_cost.csv)，非常贴近生活而且也是和时间息息相关的，用来做测试在合适不过了，这个csv文件大家可以在第四部分找到下载的地方哈</p>
<pre class=" language-python"><code class="prism  language-python"><span class="token keyword">import</span> os
<span class="token comment"># 这两行仅仅是切换路径，方便我上传Github，大家不用理会,只要确认csv文件和py文件再一起就行啦</span>
os<span class="token punctuation">.</span>chdir<span class="token punctuation">(</span><span class="token string">"F:\\Python教程\\segmentfault\\pandas_share\\Pandas之旅_07 谁说pandas慢"</span><span class="token punctuation">)</span>
</code></pre>
<p>现在让我们看看数据大概长什么样子</p>
<pre class=" language-python"><code class="prism  language-python"><span class="token keyword">import</span> numpy <span class="token keyword">as</span> np
<span class="token keyword">import</span> pandas <span class="token keyword">as</span> pd
f<span class="token string">"Using {pd.__name__},{pd.__version__}"</span>
</code></pre>
<pre><code>'Using pandas,0.23.0'
</code></pre>
<pre class=" language-python"><code class="prism  language-python">df <span class="token operator">=</span> pd<span class="token punctuation">.</span>read_csv<span class="token punctuation">(</span><span class="token string">'energy_cost.csv'</span><span class="token punctuation">,</span>sep<span class="token operator">=</span><span class="token string">','</span><span class="token punctuation">)</span>
df<span class="token punctuation">.</span>head<span class="token punctuation">(</span><span class="token punctuation">)</span>
</code></pre>
<table border="1" class="dataframe">
  <thead>
    <tr>
      <th></th>
      <th>date_time</th>
      <th>energy_kwh</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>2001/1/13 0:00</td>
      <td>0.586</td>
    </tr>
    <tr>
      <th>1</th>
      <td>2001/1/13 1:00</td>
      <td>0.580</td>
    </tr>
    <tr>
      <th>2</th>
      <td>2001/1/13 2:00</td>
      <td>0.572</td>
    </tr>
    <tr>
      <th>3</th>
      <td>2001/1/13 3:00</td>
      <td>0.596</td>
    </tr>
    <tr>
      <th>4</th>
      <td>2001/1/13 4:00</td>
      <td>0.592</td>
    </tr>
  </tbody>
</table>
<p>现在我们看到初始数据的样子了，主要有date_time和energy_kwh这两列，来表示时间和消耗的电力，比较好理解，下面让我们来看一下数据类型</p>
<pre class=" language-python"><code class="prism  language-python">df<span class="token punctuation">.</span>dtypes
<span class="token operator">&gt;&gt;</span><span class="token operator">&gt;</span> date_time      <span class="token builtin">object</span>
    energy_kwh    float64
    dtype<span class="token punctuation">:</span> <span class="token builtin">object</span>
</code></pre>
<pre class=" language-python"><code class="prism  language-python"><span class="token builtin">type</span><span class="token punctuation">(</span>df<span class="token punctuation">.</span>iat<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">,</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
<span class="token operator">&gt;&gt;</span><span class="token operator">&gt;</span> <span class="token builtin">str</span>
</code></pre>
<p>这里有个小问题，Pandas和NumPy有dtypes（数据类型）的概念。如果未指定参数，则date_time这一列的数据类型默认object，所以为了之后运算方便，我们可以把str类型的这一列转化为timestamp类型:</p>
<pre class=" language-python"><code class="prism  language-python">df<span class="token punctuation">[</span><span class="token string">'date_time'</span><span class="token punctuation">]</span> <span class="token operator">=</span> pd<span class="token punctuation">.</span>to_datetime<span class="token punctuation">(</span>df<span class="token punctuation">[</span><span class="token string">'date_time'</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
df<span class="token punctuation">.</span>dtypes

<span class="token operator">&gt;&gt;</span><span class="token operator">&gt;</span> date_time     datetime64<span class="token punctuation">[</span>ns<span class="token punctuation">]</span>
    energy_kwh           float64
    dtype<span class="token punctuation">:</span> <span class="token builtin">object</span>
</code></pre>
<p>先在大家可以发现我们通过用pd.to_datetime这个方法已经成功的把date_time这一列转化为了datetime64类型</p>
<pre class=" language-python"><code class="prism  language-python">df<span class="token punctuation">.</span>head<span class="token punctuation">(</span><span class="token punctuation">)</span>
</code></pre>
<table border="1" class="dataframe">
  <thead>
    <tr>
      <th></th>
      <th>date_time</th>
      <th>energy_kwh</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>2001-01-13 00:00:00</td>
      <td>0.586</td>
    </tr>
    <tr>
      <th>1</th>
      <td>2001-01-13 01:00:00</td>
      <td>0.580</td>
    </tr>
    <tr>
      <th>2</th>
      <td>2001-01-13 02:00:00</td>
      <td>0.572</td>
    </tr>
    <tr>
      <th>3</th>
      <td>2001-01-13 03:00:00</td>
      <td>0.596</td>
    </tr>
    <tr>
      <th>4</th>
      <td>2001-01-13 04:00:00</td>
      <td>0.592</td>
    </tr>
  </tbody>
</table>
<p>现在再来看数据, 发现已经和刚才不同了,我们还可以通过指定format参数实现一样的效果，速度上也会快一些</p>
<pre class=" language-python"><code class="prism  language-python"><span class="token operator">%</span><span class="token operator">%</span>timeit <span class="token operator">-</span>n <span class="token number">10</span>
<span class="token keyword">def</span> <span class="token function">convert_with_format</span><span class="token punctuation">(</span>df<span class="token punctuation">,</span> column_name<span class="token punctuation">)</span><span class="token punctuation">:</span>
    <span class="token keyword">return</span> pd<span class="token punctuation">.</span>to_datetime<span class="token punctuation">(</span>df<span class="token punctuation">[</span>column_name<span class="token punctuation">]</span><span class="token punctuation">,</span><span class="token builtin">format</span><span class="token operator">=</span><span class="token string">'%Y/%m/%d %H:%M'</span><span class="token punctuation">)</span>

df<span class="token punctuation">[</span><span class="token string">'date_time'</span><span class="token punctuation">]</span><span class="token operator">=</span>convert_with_format<span class="token punctuation">(</span>df<span class="token punctuation">,</span> <span class="token string">'date_time'</span><span class="token punctuation">)</span>

<span class="token operator">&gt;&gt;</span><span class="token operator">&gt;</span><span class="token number">722</span> µs ± <span class="token number">334</span> µs per loop <span class="token punctuation">(</span>mean ± std<span class="token punctuation">.</span> dev<span class="token punctuation">.</span> of <span class="token number">7</span> runs<span class="token punctuation">,</span> <span class="token number">10</span> loops each<span class="token punctuation">)</span>
</code></pre>
<p>有关具体的日期自定义相关方法，大家点击<a href="http://strftime.org/">这里</a>查看</p>
<h2 id="批量计算的技巧">2. 批量计算的技巧</h2>
<p>首先，我们假设根据用电的时间段不同，电费价目表如下：</p>

<table>
<thead>
<tr>
<th>Type</th>
<th>cents/kwh</th>
<th>periode</th>
</tr>
</thead>
<tbody>
<tr>
<td>Peak</td>
<td>28</td>
<td>17:00 to 24:00</td>
</tr>
<tr>
<td>Shoulder</td>
<td>20</td>
<td>7:00 to 17:00</td>
</tr>
<tr>
<td>Off-Peak</td>
<td>12</td>
<td>0:00 to 7:00</td>
</tr>
</tbody>
</table><p>假设我们想要计算出电费，我们可以先写出一个根据时间动态计算电费的方法“apply_tariff“</p>
<pre class=" language-python"><code class="prism  language-python"><span class="token keyword">def</span> <span class="token function">apply_tariff</span><span class="token punctuation">(</span>kwh<span class="token punctuation">,</span> hour<span class="token punctuation">)</span><span class="token punctuation">:</span>
    <span class="token triple-quoted-string string">"""Calculates cost of electricity for given hour."""</span>    
    <span class="token keyword">if</span> <span class="token number">0</span> <span class="token operator">&lt;=</span> hour <span class="token operator">&lt;</span> <span class="token number">7</span><span class="token punctuation">:</span>
        rate <span class="token operator">=</span> <span class="token number">12</span>
    <span class="token keyword">elif</span> <span class="token number">7</span> <span class="token operator">&lt;=</span> hour <span class="token operator">&lt;</span> <span class="token number">17</span><span class="token punctuation">:</span>
        rate <span class="token operator">=</span> <span class="token number">20</span>
    <span class="token keyword">elif</span> <span class="token number">17</span> <span class="token operator">&lt;=</span> hour <span class="token operator">&lt;</span> <span class="token number">24</span><span class="token punctuation">:</span>
        rate <span class="token operator">=</span> <span class="token number">28</span>
    <span class="token keyword">else</span><span class="token punctuation">:</span>
        <span class="token keyword">raise</span> ValueError<span class="token punctuation">(</span>f<span class="token string">'Invalid hour: {hour}'</span><span class="token punctuation">)</span>
    <span class="token keyword">return</span> rate <span class="token operator">*</span> kwh
</code></pre>
<p>好啦，现在我们想要在数据中新增一列 ‘cost_cents’ 来表示总价钱，我们有很多选择，首先能想到的方法便是iterrows（），它可以让我们循环遍历Dataframe的每一行，根据条件计算并赋值给新增的‘cost_cents’列</p>
<h4 id="iterrows"><em><strong>iterrows()</strong></em></h4>
<p>首先我们能做的是循环遍历流程，让我们先用.iterrows()替代上面的方法来试试：</p>
<pre class=" language-python"><code class="prism  language-python"><span class="token operator">%</span><span class="token operator">%</span>timeit <span class="token operator">-</span>n <span class="token number">10</span>
<span class="token keyword">def</span> <span class="token function">apply_tariff_iterrows</span><span class="token punctuation">(</span>df<span class="token punctuation">)</span><span class="token punctuation">:</span>
    energy_cost_list <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span>
    <span class="token keyword">for</span> index<span class="token punctuation">,</span> row <span class="token keyword">in</span> df<span class="token punctuation">.</span>iterrows<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
        <span class="token comment"># Get electricity used and hour of day</span>
        energy_used <span class="token operator">=</span> row<span class="token punctuation">[</span><span class="token string">'energy_kwh'</span><span class="token punctuation">]</span>
        hour <span class="token operator">=</span> row<span class="token punctuation">[</span><span class="token string">'date_time'</span><span class="token punctuation">]</span><span class="token punctuation">.</span>hour
        <span class="token comment"># Append cost list</span>
        energy_cost <span class="token operator">=</span> apply_tariff<span class="token punctuation">(</span>energy_used<span class="token punctuation">,</span> hour<span class="token punctuation">)</span>
        energy_cost_list<span class="token punctuation">.</span>append<span class="token punctuation">(</span>energy_cost<span class="token punctuation">)</span>
    df<span class="token punctuation">[</span><span class="token string">'cost_cents'</span><span class="token punctuation">]</span> <span class="token operator">=</span> energy_cost_list

apply_tariff_iterrows<span class="token punctuation">(</span>df<span class="token punctuation">)</span>
</code></pre>
<pre><code>983 ms ± 65.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
</code></pre>
<p>我们为了测试方便，所有的方法都会循环10次来比较耗时，这里很明显我们有很大的改进空间，下面我们用apply方法来优化</p>
<h4 id="apply"><em><strong>apply()</strong></em></h4>
<pre class=" language-python"><code class="prism  language-python"><span class="token operator">%</span><span class="token operator">%</span>timeit <span class="token operator">-</span>n <span class="token number">10</span>
<span class="token keyword">def</span> <span class="token function">apply_tariff_withapply</span><span class="token punctuation">(</span>df<span class="token punctuation">)</span><span class="token punctuation">:</span>
    df<span class="token punctuation">[</span><span class="token string">'cost_cents'</span><span class="token punctuation">]</span> <span class="token operator">=</span> df<span class="token punctuation">.</span><span class="token builtin">apply</span><span class="token punctuation">(</span>
        <span class="token keyword">lambda</span> row<span class="token punctuation">:</span> apply_tariff<span class="token punctuation">(</span>
            kwh<span class="token operator">=</span>row<span class="token punctuation">[</span><span class="token string">'energy_kwh'</span><span class="token punctuation">]</span><span class="token punctuation">,</span>
            hour<span class="token operator">=</span>row<span class="token punctuation">[</span><span class="token string">'date_time'</span><span class="token punctuation">]</span><span class="token punctuation">.</span>hour<span class="token punctuation">)</span><span class="token punctuation">,</span>
        axis<span class="token operator">=</span><span class="token number">1</span><span class="token punctuation">)</span>

apply_tariff_withapply<span class="token punctuation">(</span>df<span class="token punctuation">)</span>
</code></pre>
<pre><code>247 ms ± 24.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
</code></pre>
<p>这回速度得到了很大的提升,但是显然我们还没有get到pandas加速的精髓：矢量化操作。下面让我们开始提速</p>
<h4 id="isin"><strong>isin()</strong></h4>
<p>假设我们现在的电价是定值，不根据用电时间段来改变，那么pandas中最快的方法那就是采用(df[‘cost_cents’] = df[‘energy_kwh’] * price)，这就是一个简单的矢量化操作示范。它基本是在Pandas中运行最快的方式。</p>
<p>目前的问题是我们的价格是动态的，那么如何将条件判断添加到Pandas中的矢量化运算中呢？答案就是我们根据条件选择和分组DataFrame，然后对每个选定的组应用矢量化操作:</p>
<pre class=" language-python"><code class="prism  language-python"><span class="token comment">#先让我们把时间序列作为索引</span>
df<span class="token punctuation">.</span>set_index<span class="token punctuation">(</span><span class="token string">'date_time'</span><span class="token punctuation">,</span> inplace<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">)</span>
</code></pre>
<pre class=" language-python"><code class="prism  language-python"><span class="token operator">%</span><span class="token operator">%</span>timeit <span class="token operator">-</span>n <span class="token number">10</span>
<span class="token keyword">def</span> <span class="token function">apply_tariff_isin</span><span class="token punctuation">(</span>df<span class="token punctuation">)</span><span class="token punctuation">:</span>
    <span class="token comment"># Define hour range Boolean arrays</span>
    peak_hours <span class="token operator">=</span> df<span class="token punctuation">.</span>index<span class="token punctuation">.</span>hour<span class="token punctuation">.</span>isin<span class="token punctuation">(</span><span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">17</span><span class="token punctuation">,</span> <span class="token number">24</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
    shoulder_hours <span class="token operator">=</span> df<span class="token punctuation">.</span>index<span class="token punctuation">.</span>hour<span class="token punctuation">.</span>isin<span class="token punctuation">(</span><span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">7</span><span class="token punctuation">,</span> <span class="token number">17</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
    off_peak_hours <span class="token operator">=</span> df<span class="token punctuation">.</span>index<span class="token punctuation">.</span>hour<span class="token punctuation">.</span>isin<span class="token punctuation">(</span><span class="token builtin">range</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">7</span><span class="token punctuation">)</span><span class="token punctuation">)</span>

    <span class="token comment"># Apply tariffs to hour ranges</span>
    df<span class="token punctuation">.</span>loc<span class="token punctuation">[</span>peak_hours<span class="token punctuation">,</span> <span class="token string">'cost_cents'</span><span class="token punctuation">]</span> <span class="token operator">=</span> df<span class="token punctuation">.</span>loc<span class="token punctuation">[</span>peak_hours<span class="token punctuation">,</span> <span class="token string">'energy_kwh'</span><span class="token punctuation">]</span> <span class="token operator">*</span> <span class="token number">28</span>
    df<span class="token punctuation">.</span>loc<span class="token punctuation">[</span>shoulder_hours<span class="token punctuation">,</span><span class="token string">'cost_cents'</span><span class="token punctuation">]</span> <span class="token operator">=</span> df<span class="token punctuation">.</span>loc<span class="token punctuation">[</span>shoulder_hours<span class="token punctuation">,</span> <span class="token string">'energy_kwh'</span><span class="token punctuation">]</span> <span class="token operator">*</span> <span class="token number">20</span>
    df<span class="token punctuation">.</span>loc<span class="token punctuation">[</span>off_peak_hours<span class="token punctuation">,</span><span class="token string">'cost_cents'</span><span class="token punctuation">]</span> <span class="token operator">=</span> df<span class="token punctuation">.</span>loc<span class="token punctuation">[</span>off_peak_hours<span class="token punctuation">,</span> <span class="token string">'energy_kwh'</span><span class="token punctuation">]</span> <span class="token operator">*</span> <span class="token number">12</span>

apply_tariff_isin<span class="token punctuation">(</span>df<span class="token punctuation">)</span>
</code></pre>
<pre><code>5.7 ms ± 871 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
</code></pre>
<p>这回我们发现速度是真正起飞了，首先我们根据用电的三个时段把df进行分三组，再依次进行三次矢量化操作，大家可以发现最后减少了很多时间，原理很简单：</p>
<p>在运行的时候，.isin（）方法返回一个布尔值数组，如下所示：</p>
<ul>
<li>[False, False, False, …, True, True, True]</li>
</ul>
<p>接下来布尔数组传递给DataFrame的.loc索引器时，我们获得一个仅包含与3个用电时段匹配DataFrame切片。然后简单的进行乘法操作就行了，这样做的好处是我们已经不需要刚才提过的apply方法了，因为不在存在遍历所有行的问题</p>
<h4 id="我们可以做的更好吗？">我们可以做的更好吗？</h4>
<p>通过观察可以发现，在apply_tariff_isin（）中，我们仍然在通过调用df.loc和df.index.hour.isin（）来进行一些“手动工作”。如果想要进一步提速,我们可以使用cut方法</p>
<pre class=" language-python"><code class="prism  language-python"><span class="token operator">%</span><span class="token operator">%</span>timeit <span class="token operator">-</span>n <span class="token number">10</span>
<span class="token keyword">def</span> <span class="token function">apply_tariff_cut</span><span class="token punctuation">(</span>df<span class="token punctuation">)</span><span class="token punctuation">:</span>
    cents_per_kwh <span class="token operator">=</span> pd<span class="token punctuation">.</span>cut<span class="token punctuation">(</span>x<span class="token operator">=</span>df<span class="token punctuation">.</span>index<span class="token punctuation">.</span>hour<span class="token punctuation">,</span>
                           bins<span class="token operator">=</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">,</span> <span class="token number">7</span><span class="token punctuation">,</span> <span class="token number">17</span><span class="token punctuation">,</span> <span class="token number">24</span><span class="token punctuation">]</span><span class="token punctuation">,</span>
                           include_lowest<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">,</span>
                           labels<span class="token operator">=</span><span class="token punctuation">[</span><span class="token number">12</span><span class="token punctuation">,</span> <span class="token number">20</span><span class="token punctuation">,</span> <span class="token number">28</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">.</span>astype<span class="token punctuation">(</span><span class="token builtin">int</span><span class="token punctuation">)</span>
    df<span class="token punctuation">[</span><span class="token string">'cost_cents'</span><span class="token punctuation">]</span> <span class="token operator">=</span> cents_per_kwh <span class="token operator">*</span> df<span class="token punctuation">[</span><span class="token string">'energy_kwh'</span><span class="token punctuation">]</span>
</code></pre>
<pre><code>140 ns ± 29.9 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
</code></pre>
<p>效果依然锋利，速度上有了成倍的提升</p>
<h4 id="不要忘了用numpy">不要忘了用Numpy</h4>
<p>众所周知，Pandas是在Numpy上建立起来的，所以在Numpy中当然有类似cut的方法可以实现分组,从速度上来讲差不太多</p>
<pre class=" language-python"><code class="prism  language-python"><span class="token operator">%</span><span class="token operator">%</span>timeit <span class="token operator">-</span>n <span class="token number">10</span>
<span class="token keyword">def</span> <span class="token function">apply_tariff_digitize</span><span class="token punctuation">(</span>df<span class="token punctuation">)</span><span class="token punctuation">:</span>
    prices <span class="token operator">=</span> np<span class="token punctuation">.</span>array<span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">12</span><span class="token punctuation">,</span> <span class="token number">20</span><span class="token punctuation">,</span> <span class="token number">28</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
    bins <span class="token operator">=</span> np<span class="token punctuation">.</span>digitize<span class="token punctuation">(</span>df<span class="token punctuation">.</span>index<span class="token punctuation">.</span>hour<span class="token punctuation">.</span>values<span class="token punctuation">,</span> bins<span class="token operator">=</span><span class="token punctuation">[</span><span class="token number">7</span><span class="token punctuation">,</span> <span class="token number">17</span><span class="token punctuation">,</span> <span class="token number">24</span><span class="token punctuation">]</span><span class="token punctuation">)</span>
    df<span class="token punctuation">[</span><span class="token string">'cost_cents'</span><span class="token punctuation">]</span> <span class="token operator">=</span> prices<span class="token punctuation">[</span>bins<span class="token punctuation">]</span> <span class="token operator">*</span> df<span class="token punctuation">[</span><span class="token string">'energy_kwh'</span><span class="token punctuation">]</span><span class="token punctuation">.</span>values

</code></pre>
<pre><code>54.9 ns ± 19.3 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
</code></pre>
<p>正常情况下，以上的加速方法是能满足日常需要的，如果有特殊的需求，大家可以上网看看有没有相关的第三方加速包</p>
<h2 id="通过hdfstore存储数据节省时间">3. 通过HDFStore存储数据节省时间</h2>
<p>这里主要想强调的是节省预处理的时间，假设我们辛辛苦苦搭建了一些模型，但是每次运行之前都要进行一些预处理，比如类型转换，用时间序列做索引等，如果不用HDFStore的话每次都会花去不少时间，这里Python提供了一种解决方案，可以把经过预处理的数据存储为HDF5格式，方便我们下次运行时直接调用。</p>
<p>下面就让我们把本篇文章的df通过HDF5来存储一下：</p>
<pre class=" language-python"><code class="prism  language-python"><span class="token comment"># Create storage object with filename `processed_data`</span>
data_store <span class="token operator">=</span> pd<span class="token punctuation">.</span>HDFStore<span class="token punctuation">(</span><span class="token string">'processed_data.h5'</span><span class="token punctuation">)</span>

<span class="token comment"># Put DataFrame into the object setting the key as 'preprocessed_df'</span>
data_store<span class="token punctuation">[</span><span class="token string">'preprocessed_df'</span><span class="token punctuation">]</span> <span class="token operator">=</span> df
data_store<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span>
</code></pre>
<p>现在我们可以关机下班了，当明天接着上班后，通过key（“preprocessed_df”）就可以直接使用经过预处理的数据了</p>
<pre class=" language-python"><code class="prism  language-python"><span class="token comment"># Access data store</span>
data_store <span class="token operator">=</span> pd<span class="token punctuation">.</span>HDFStore<span class="token punctuation">(</span><span class="token string">'processed_data.h5'</span><span class="token punctuation">)</span>

<span class="token comment"># Retrieve data using key</span>
preprocessed_df <span class="token operator">=</span> data_store<span class="token punctuation">[</span><span class="token string">'preprocessed_df'</span><span class="token punctuation">]</span>
data_store<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span>
</code></pre>
<pre class=" language-python"><code class="prism  language-python">preprocessed_df<span class="token punctuation">.</span>head<span class="token punctuation">(</span><span class="token punctuation">)</span>
</code></pre>
<table border="1" class="dataframe">
  <thead>
    <tr>
      <th></th>
      <th>energy_kwh</th>
      <th>cost_cents</th>
    </tr>
    <tr>
      <th>date_time</th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>2001-01-13 00:00:00</th>
      <td>0.586</td>
      <td>7.032</td>
    </tr>
    <tr>
      <th>2001-01-13 01:00:00</th>
      <td>0.580</td>
      <td>6.960</td>
    </tr>
    <tr>
      <th>2001-01-13 02:00:00</th>
      <td>0.572</td>
      <td>6.864</td>
    </tr>
    <tr>
      <th>2001-01-13 03:00:00</th>
      <td>0.596</td>
      <td>7.152</td>
    </tr>
    <tr>
      <th>2001-01-13 04:00:00</th>
      <td>0.592</td>
      <td>7.104</td>
    </tr>
  </tbody>
</table>
<p>如上图所示，现在我们可以发现date_time已经是处理为index了</p>
<h2 id="源码，相关数据及github地址">4. 源码，相关数据及GitHub地址</h2>
<p>这一期为大家分享了一些pandas加速的实用技巧，希望可以帮到各位小伙伴，当然，类似的技巧还有很多，但是核心思想应该一直围绕矢量化操作上，毕竟是基于Numpy上建立的包，如果大家有更好的办法，希望可以在我的文章底下留言哈</p>
<p>我把这一期的ipynb文件，py文件以及我们用到的energy_cost.csv放到了Github上，大家可以点击下面的链接来下载：</p>
<ul>
<li>Github仓库地址： <a href="https://github.com/yaozeliang/pandas_share/tree/master/Pandas%E4%B9%8B%E6%97%85_04%20pandas%E8%B6%85%E5%AE%9E%E7%94%A8%E6%8A%80%E5%B7%A7">https://github.com/yaozeliang/pandas_share</a></li>
</ul>
<p>希望大家能够继续支持我，这一篇文章已经是Pandas系列的最后一篇了，虽然一共只写了7篇文章，但是我认为从实用性上来讲并没有太逊色于收费课程（除了少了很多漂亮的ppt），接下来我会再接再厉，分享一下我对R （ggplot2）或者matplotlib的学习经验！！</p>
<p>Pandas之旅到此结束。撒花</p>

    </div>
  </div>
</body>

</html>
